Quantifying substantial carcinogenesis of genetic and environmental factors from measurement error in the number of stem cell divisions

Background The relative contributions of genetic and environmental factors versus unavoidable stochastic risk factors to the variation in cancer risk among tissues have become a widely-discussed topic. Some claim that the stochastic effects of DNA replication are mainly responsible, others believe that cancer risk is heavily affected by environmental and hereditary factors. Some of these studies made evidence from the correlation analysis between the lifetime number of stem cell divisions within each tissue and tissue-specific lifetime cancer risk. However, they did not consider the measurement error in the estimated number of stem cell divisions, which is caused by the exposure to different levels of genetic and environmental factors. This will obscure the authentic contribution of environmental or inherited factors. Methods In this study, we proposed two distinct modeling strategies, which integrate the measurement error model with the prevailing model of carcinogenesis to quantitatively evaluate the contribution of hereditary and environmental factors to cancer development. Then, we applied the proposed strategies to cancer data from 423 registries in 68 different countries (global-wide), 125 registries across China (national-wide of China), and 139 counties in Shandong province (Shandong provincial, China), respectively. Results The results suggest that the contribution of genetic and environmental factors is at least 92% to the variation in cancer risk among 17 tissues. Moreover, mutations occurring in progenitor cells and differentiated cells are less likely to be accumulated enough for cancer to occur, and the carcinogenesis is more likely to originate from stem cells. Except for medulloblastoma, the contribution of genetic and environmental factors to the risk of other 16 organ-specific cancers are all more than 60%. Conclusions This work provides additional evidence that genetic and environmental factors play leading roles in cancer development. Therefore, the identification of modifiable environmental and hereditary risk factors for each cancer is highly recommended, and primary prevention in early life-course should be the major focus of cancer prevention. Supplementary Information The online version contains supplementary material available at 10.1186/s12885-022-10219-w.

Let s denote the total number of stem cells found in a fully developed tissue, and d denotes the total time of further divisions for each stem cell in the lifetime. 1 The LSCD0 of each tissue equals to the sum of the division number before the tissue has been fully developed ( 2 log 1 2 s n n=  ) and the number of further divisions after the tissue has been fully developed during the lifetime (sd). 1 Tomasetti and Vogelstein 1 has shown that this calculation formula equals to (

The lifetime cancer risk (LCR)
In this study, the LCR of 17 cancer types were calculated from global-wide (the same data as represent age interval of 0-4, 5-9, 10-14, ..., 70-74), and yi denote the person-years at risk in i th age interval. Then the LCR of each organ-specific cancer was calculated through equation (1) and (2)  These two databases were linked by standard administrative code, then LCR of each organ-specific cancer was calculated using formula (1) and (2). 4

Supplementary Method 2: Simulation study of the first modelling strategy
According to the diagram in Figure 1 ). We conducted a simulation study to examine this modelling strategy. We considered two scenarios that LCRi follows a Normal distribution (Scenario 1) and non-Normal distribution (Scenario 2).

Scenario 1: LCRi follows a normal distribution
For scenario 1, our data (ranked LCR matrix) are simulated from the following plausibly realistic data-generating model, corresponding to Figure 1 We set the number of EHlat level: n=500, and for each EHlat, we considered 17 cancer types.
In the generated ranked LCR matrix, LSCD1 was generated from equation ).

Scenario 2: LCRi follows a non-normal distribution
For scenario 2, our data are simulated from the following plausibly realistic data-generating model, corresponding to Figure 1

Supplementary Method 3: Data sources of LTCD0 and SMN0
The error-prone total number of tissue cell divisions per lifetime (from birth to age 74) LTCD0 of seven cancers, including colorectal adenocarcinoma, hepatocellular carcinoma, lung adenocarcinoma, osteosarcoma, testicular germ cell carcinoma, prostate cancer, breast cancer (the details about the cellular turnover rates for other ten tissues are not available) were calculated through ( ) where v denotes the tissue turnover rate per cell per day, obtained from the Database of Useful The error-prone somatic mutation number SMN0 of 16 tissues (excluding tissue of osteosarcoma) were obtained from the supplementary materials in Yizhak et al. 7 . This study applied RNA-MuTect to 6707 samples against their matched-blood DNA, which spanned 29 human tissues and 488 individuals, and detected 8870 somatic mutations in 37% of the samples. 7 The maximal somatic mutation number of each tissue was used as SMN0 in our study. 8

Supplementary Method 4: Sensitivity analysis to examine the impact of screening
We performed a simulation study to examine the impact of screening on the results of our modelling strategy. Our data are simulated from the following plausibly realistic data-generating model, corresponding to Figure S2. Skin. the human body surface-area of the skin increases with age. Thus, we assume s is 9 1.08 10  for age group of 0-4 and 0-9. ASCD0j of melanoma in age group of 0-14 and 0-19 were the sum of ASCD0j in 0-9 and the division times after age 9 with parameter s equals to 9 1.86 10  ; ASCD0j in other age groups were the sum of ASCD0j in age interval of 0-19, and the division times after age 19 with parameter s equals to 9 2.01 10  .

Supplementary Method 6: The estimation of range of basic ACR under the laboratory environment
The simplest form of multistage model can be represented by where ACRj denotes the cumulate riskin j th age interval, s denotes the total number of stem cells found in a fully developed tissue, and dj denotes the total time of further divisions for each stem cell in j th age interval,  denotes the somatic mutation rate, which is usually assumed to be constant across tissues, and M is the number of hits required to initiate one cancer . 10,11 In this function,  and M are two unknown parameters to be estimated.
Since the obtained s and dj of each organ-specific cancer were estimated based on the laboratory environment, ACRj in equation (4) should be cancer risk in laboratory environment (ACR0j). We assume that ACR0j for each cancer type approximately equals to the minimum nonzero value of ACRj of that cancer in global-wide and Shandong provincial ranked ACR matrixes.
Therefore, accompany by estimated s and dj, the ACR0j of 17 organ-specific cancers were first   16 Figure S5. Scatter diagram of the ranked ACR matrix and the estimated range of basic ACR for lung adenocarcinoma. The ranked ACR matrix (including age interval of 0-4, 0-9, …, 0-74) was illustrated by blue dots, and the estimated range of basic ACR was illustrated by yellow dots. ACR: the cancer risk with respect to the specific age interval.

Additional file 2
Table.S1: Ranked matrix of LCR in global scope and the contribution of the genetic and environmental factors on the variation in cancer risk